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Abstract —Face images appeared in multimedia applications, 
e,g,, social networks and digital entertainment, usually exhibit 
dramatic pose, illumination, and expression variations, resulting 
in considerable performance degradation for traditional face 
recognition algorithms. This paper proposes a comprehensive 
deep learning framework to jointly learn face representation 
using multimodal information. The proposed deep learning struc¬ 
ture is composed of a set of elaborately designed convolutional 
neural networks (CNNs) and a three-layer stacked auto-encoder 
(SAE). The set of CNNs extracts complementary facial features 
from multimodal data. Then, the extracted features are concate¬ 
nated to form a high-dimensional feature vector, whose dimension 
is compressed by SAE. All the CNNs are trained using a subset 
of 9,000 subjects from the publicly available CASIA-WebFace 
database, which ensures the reproducibility of this work. Using 
the proposed single CNN architecture and limited training data, 
98.43% verification rate is achieved on the LFW database. 
Benefited from the complementary information contained in 
multimodal data, our small ensemble system achieves higher than 
99.0% recognition rate on LFW using publicly available training 
set. 

Index Terms —Face recognition, deep learning, convolutional 
neural networks, multimodal system. 


1. Introduction 

F ace recognition has been one of the most extensively 
studied topics in computer vision. The importance of face 
recognition is closely related to its great potential in multime¬ 
dia applications, e.g., photo album management in social net¬ 
works, human machine interaction, and digital entertainment. 
With years of effort, significant progress has been achieved 
for face recognition. However, it remains a challenging task 
for multimedia applications, as observed in recent works (ll, 
lHI- In this paper, we handle the face recognition problem for 
matching internet face images appeared in social networks, 
which is one of the most common applications in multimedia 
circumstances. 

Recognizing the face images appeared in social networks is 
difficult, due to the reasons mainly from the following two per¬ 
spectives. First, the face images uploaded to social networks 
are captured in real-world conditions; therefore faces in these 
images usually exhibit rich variations in pose, illumination, 
expression, and occlusion, as illustrated in Fig. Second, 
face recognition in social networks is a large-scale recognition 
problem due to the numerous face images of potentially large 
amount of users. The prediction accuracy of face recognition 
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Fig. 1. Face images in multimedia applications usually exhibit rich variations 
in pose, illumination, expression, and occlusion. 

algorithms usually degrades dramatically with the increase of 
face identities. 

Accurate face recognition depends on high quality face 
representations. Good face representation should be discrim¬ 
inative to the change of face identify while remains robust 
to intra-personal variations. Conventional face representations 
are built on local descriptors, e.g.. Local Binary Patterns 
(LBP) 01, Local Phase Quantization (LPQ) El, 01, Dual- 
Cross Patterns (DCP) 01, and Binarised Statistical Image 
Features (BSIF) Q. However, the representation composed 
by local descriptors is too shallow to differentiate the com¬ 
plex nonlinear facial appearance variations. To handle this 
problem, recent works turn to Convolutional Neural Networks 
(CNNs) El, m to automatically learn effective features that 
are robust to the nonlinear appearance variation of face images. 
However, the existing works of CNN on face recognition 
extract features from limited modalities, the complementary 
information contained in more modalities is not well studied. 

Inspired by the complementary information contained in 
multi-modalities and the recent progress of deep learning on 
various fields of computer vision, we present a novel face 
representation framework that adopts an ensemble of CNNs 
to leverage the multimodal information. The performance of 
the proposed multimodal system is optimized from two per¬ 
spectives. First, the architecture for single CNN is elaborately 
designed and optimized with extensive experimentations. Sec¬ 
ond, a set of CNNs is designed to extract complementary 
information from multiple modalities, i.e., the holistic face 
image, the rendered frontal face image by 3D model, and 
uniformly sampled face patches. Besides, we design different 
structures for different modalities, i.e., a complex structure is 
designed for the modality that contains the richest information 
while a simple structure is proposed for the modalities with 
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less information. In this way, we strike a balance between 
recognition performance and efficiency. The capacity of each 
modality for face recognition is also compared and discussed. 

We term the proposed deep learning-based face representa¬ 
tion scheme as Multimodal Deep Face Representation (MM- 
DFR), as illustrated in Fig. Under this framework, the face 
representation of one face image involves feature extraction 
using each of the designed CNNs. The extracted features 
are concatenated as the raw feature vector, whose dimension 
is compressed by a three-layer SAE. Extensive experiments 
on the Labeled Face in the Wild (LFW) CO) and CASIA- 
WebFace databases m indicate that superior performance 
is achieved with the proposed MM-DFR framework. Besides, 
the infiuence of several implementation details, e.g., the usage 
strategies of ReLU nonlinearity, multiple modalities, aggres¬ 
sive data augmentation, multi-stage training, and L2 normal¬ 
ization, is compared and discussed in the experimentation 
section. To the best of our knowledge, this is the first published 
approach that achieves higher than 99.0% recognition rate 
using a publicly available training set on the LFW database. 

The remainder of the paper is organized as follows: Section 
II briefiy reviews related works for face recognition and deep 
learning. The proposed MM-DFR face representation scheme 
is illustrated in Section III. Face matching using MM-DFR is 
described in Section IV. Experimental results are presented in 
Section V, leading to conclusions in Section VI. 

II. Related Studies 
A. Face Image Representation 

Popular face representations can be broadly grouped into 
two categories: local descriptor-based representations and deep 
learning-based representations. 

Traditional face representations are based on local descrip¬ 
tors da, ca. Local descriptors can be further divided into 
two groups: the handcrafted descriptors and the learning-based 
descriptors. Among the handcrafted descriptors, Ahonen et 
ah (3) proposed to employ the texture descriptor EBP for 
face representation. EBP works by encoding the gray-value 
difference between each pixel and its neighboring pixels into 
binary codes. Ding et al. ii proposed the Dual-Cross Patterns 
(DCP) descriptor to encode second order statistics along the 
distribution directions of facial components. Other effective 
handcrafted local descriptors include Local Phase Quantiza- 
tion (LPQ) El and Gabor-based descriptors. Representative 
learning-based descriptors include Binarised Statistical Image 
Features (BSIF) Q, (Ml and Discriminant Pace Descriptor 
(DPD) ifTSll . et al. Compared with the handcrafted descriptors, 
the learning-based descriptors usually optimize the pattern 
encoding step using machine learning techniques. An extensive 
and systematic comparison among existing local descriptors 
for face recognition can be found in ||6l; and a detailed 
summarization on local descriptor-based face representations 
can be found in a recent survey (ll . Despite of its ease of use, 
the local descriptor-based approaches have clear limitations: 
the constructed face reprsentation is sensitive to the non-linear 
intra-personal variations, e.g., pose na, expression lEl, and 
illumination ca. In particular, the intra-personal appearance 


change caused by pose variations may substantially surpass 
the difference caused by identities GS). 

The complicated facial appearance variations call for non¬ 
linear techniques for robust face representation, and recent 
progress on deep learning provides an effective tool. In the 
following, we review the most relevant progress for deep 
learning-based face recognition. Taigman et al O proposed 
the DeepEace architecture for face recognition. They use the 
softmax loss, i.e., the face identification loss, as the supervi¬ 
sory signal to train the network and achieve high recognition 
accuracy approaching the human-level. Sun et al proposed 
to combine the identification and verification losses for more 
effective training. They empirically verified that the combined 
supervisory signal is helpful to promote the discriminative 
power of extracted CNN features. Zhou et al ca investigated 
the influence of distribution and size of training data to the 
performance of CNN. With a huge training set composed of 
5 millions of labelled faces, they achieved an accuracy of 
99.5% accuracy on LEW using naive CNN structures. One 
common problem for the above works is that they all employ 
private face databases for training. Due to the distinct size and 
unknown distribution of these private data, the performance 
of the above works may not be directly comparable. Recently, 
Yi et al Eli released the CASIA-WebEace database which 
contains 494,414 labeled images of 10,575 subjects. The 
availability of such a large-scale database enables researchers 
to compete on a fair starting line. In this paper, the training 
of all CNNs are conducted exclusively on a subset of 9,000 
subjects of the CASIA-WebEace database, which ensures the 
reproducibility of this work. The CNN architectures designed 
in this paper are inspired by two previous works CD, im, 
but with a number of modifications and improvements, and our 
designed CNN models have visible advantage in performance. 

B. Multimodal-based Face Recognition 

Most of face recognition algorithms extract a single face 
representation from the face image. However, they are re¬ 
strictive in capturing the diverse information contained in the 
face image. To handle this problem. Ding et al O proposed 
to extract the Multi-directional Multi-level DCPs (MDML- 
DCPs) feature which includes three holistic-level features 
and six component-level features. The set of the nine facial 
features composes the face representation. Similar strategies 
have been adopted in deep learning-based face representations. 
Eor example, the DeepEace approach m adopts the same CNN 
structure to extract facial features from RGB image, gray-level 
image and gradient map. The set of face representations are 
fused in the score level. Sun et al la proposed to extract 
deep features from 25 image patches cropped with various 
scales and positions. The dimension of the concatenated deep 
features is reduced by Principle Component Analysis (PCA). 
Multimodal systems that fuse multiple feature cues are also 
employed in other topics of multimedia and computer vision, 
e.g., visual tracking ( 201 , image classification ( 211 , ( 221 ^ (221 . 
and social media analysis (24l . (251, EH, (221, (28l . 

Our multimodal face recognition system is related to the 
previous approaches, and there is clear novelty. Eirst, we 
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Fig. 2. Flowchart of the proposed multimodal deep face representation (MM-DFR) framework. MM-DFR is essentially composed of two steps: multimodal 
feature extraction using a set of CNNs, and feature-level fusion of the set of CNN features using SAE. 


extract multimodal features from the holistic face image, 
rendered frontal face by 3D face model, and uniformly sam¬ 
pled image patches. The three modalities stand for holistic 
facial features and local facial features, respectively. Different 
from in that employs the 3D model to assist 2D piece-wise 
face warping, we utilize the 3D model to render a frontal 
face in 3D domain, which indicates much stronger alignment 
compared with Different from m that randomly crops 
25 patches over the face image using dense facial feature 
points, we uniformly sample a small number of patches with 
the help of 3D model and sparse facial landmarks, which is 
more reliable compared with dense landmarks. Second, we 
propose to employ SAE to compress the high-dimensional 
deep feature into a compact face signature. Compared with 
the traditional PCA approach for dimension reduction, SAE 
has advantage in learning non-linear feature transformations. 
Third, the large-scale unconstrained face identification prob¬ 
lem has not been well studied due to the lack of appropriate 
face databases. Eortunately, the recently published CASIA- 
WebEace im database provides the possibility for such kind 
of evaluation. In this paper, we evaluate the identification 
performance of MM-DER on the CASIA-WebEace database. 

III. Multimodal Deep Eace Representation 

In this section, we describe the proposed MM-DER frame¬ 
work for face representation. As shown in Eig. MM-DER 
is essentially composed of two steps: multimodal feature 
extraction using a set of CNNs, and feature-level fusion of the 
set of CNN features using SAE. In the following, we describe 
the two main components in detail. 

A. Single CNN Architecture 

All face images employed in this paper are first normalized 
to 230 X 230 pixels with an affine transformation according to 
the coordinates of five sparse facial feature points, /.p., both 
eye centers, the nose tip, and both mouth corners. Sample 
images after the affine transformation are illustrated in Eig. 
We employ an off-the-shelf face alignment tool 1^ for facial 


feature detection. Based on the normalized image, one holistic 
face image of size 165 x 120 pixels (Eig. [^) and six image 
patches of size 100 x 100 pixels (Eig.[^) are sampled. Another 
holistic face image is obtained by 3D pose normalization using 
OpenGL d. Pose variation is reduced in the rendered frontal 
face, as shown in Eig. [^. 

Two CNN models named NNl and NN2 are designed, 
which are closely related to the ones proposed in csi, im, but 
with a number of modifications and improvements. We denote 
the CNN that extracts feature from the holistic face image 
as CNN-Hl. In the following, we take CNN-Hl for example 
to illustrate the architectures of NNl and NN2, as shown 
in Table |T| and Table |I^ respectively. The other seven CNNs 
employ similar structure but with modifications in resolution 
for each layer. The major difference between NNl and NN2 
is that NN2 is both deeper and wider than NNl. With larger 
structure, NN2 is more robust to highly non-linear facial 
appearance variations; therefore, we apply it to CNN-Hl. NNl 
is smaller but more efficient and we apply it to the other seven 
CNNs, with the underlying assumption that the image patches 
and pose normalized face contain less nonlinear appearance 
variations. Compared with NNl, NN2 is more vulnerable to 
overfitting due to its larger number of parameters. In this paper, 
we make use of aggressive data augmentation and multi-stage 
training strategies to reduce overfitting. Details of the two 
strategies are described in the experimentation section. 

NNl contains 10 convolutional layers, 4 max-pooling lay¬ 
ers, 1 mean-pooling layer, and 2 fully-connected layers. In 
comparison, NN2 incorporates 12 convolutional layers. Small 
filters of 3 X 3 are utilized for all convolutional layers. As 
argued in C3, successive convolutions by small filters equal 
to one convolution operation by a large filter, but effectively 
enhances the model’s discriminative power and reduces the 
number of filter parameters to learn. ReLU |[^ activation 
function is utilized after all but the last convolutional layers. 
The removal of ReLU nonlinearity helps to generate dense 
features, as described in CD . We also remove the ReLU non¬ 
linearity after Ec6; therefore the projection of convolutional 
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TABLE I 

Details of the model architecture for NNl 


TABLE II 

Details of the model architecture for NN2 


Name 

Type 

Input 

Size 

Filter 

Number 

Filter Size 

/stride /pad 

With 

Relu 

Convll 

conv 

165x120 

64 

3x3 /I /O 

yes 

Convl2 

conv 

163x118 

128 

3x3 /I /O 

yes 

Pooll 

max pool 

161x116 

N/A 

2x2 /2 /O 

no 

Conv21 

conv 

80x58 

64 

3x3 /I /O 

yes 

Conv22 

conv 

78x56 

128 

3x3 /I /O 

yes 

Pool2 

max pool 

76x54 

N/A 

2x2 /2 /O 

no 

Conv31 

conv 

38x27 

64 

3x3 /I /I 

yes 

Conv32 

conv 

38x27 

128 

3x3 /I /I 

yes 

Pool3 

max pool 

38x27 

N/A 

2x2 /2 /I 

no 

Conv41 

conv 

20x14 

128 

3x3 /I /I 

yes 

Conv42 

conv 

20x14 

256 

3x3 /I /I 

yes 

Pool4 

max pool 

20x14 

N/A 

2x2 /2 /O 

no 

Conv51 

conv 

10x7 

128 

3x3 /I /I 

yes 

Conv52 

conv 

10x7 

256 

3x3 /I /I 

no 

Pool5 

mean pool 

10x7 

N/A 

2x2 /2 /I 

no 

Dropout 

dropout 

6144x1 

N/A 

N/A 

N/A 

Fc6 

fully-conn 

512x1 

N/A 

N/A 

no 

Fc7 

fully-conn 

9000x1 

N/A 

N/A 

no 

Softmax 

softmax 

9000x1 

N/A 

N/A 

N/A 


features by Fc6 layer is from dense to dense, which means 
that Fc6 effectively equals to a linear dimension reduction 
layer that is similar to PCA or Linear Discriminative Analysis 
(LDA). This is different from previous works that favor sparse 
features produced by ReLU lO, |@, lISTIl . Our model is 
also different from im since CD simply removes the linear 
dimension reduction layer (Fc6). The output of the Fc6 layer is 
employed as face representation. In the experimental section, 
we empirically justify that the dense-to-dense projection by 
Fc6 is advantageous to produce more discriminative features. 
The forward function of ReLU is represented as 

R{xi) = max{0, Wj Xi + be), ( 1 ) 

where Xi, Wc, and he are the input, weight, and bias of the 
corresponding convolutional layer before the ReLU activation 
function. R{xi) is the output of the ReLU activation function. 
The dimension of the Fc6 layer is set to 512. The dimension of 
the Fc7 is set to 9000, which equals to the number of training 
subjects employed in this paper. We employ dropout |[32l as a 
regularizer on the first fully-connected layer in the case of 
overfitting caused by the large amount of parameters. The 
dropout ratio is set to 0.4. Since this low-dimensional face 
representation is utilized to distinguish as large as 9,000 
subjects in the training set, it should be very discriminative 
and has good generalization ability. 

The other holistic image is rendered by OpenGL with the 
help of 3D generic face model ca. Pose variation is reduced 
in the rendered image. We denote the CNN that extracts deep 
feature from this image as CNN-H2, as illustrated in Fig. 
Therefore, the first two CNNs encode holistic image features 
from different modalities. The CNNs that extract features from 
the six image patches are denoted as CNN-Pl, CNN-P2, to 


Name 

Type 

Input 

Size 

Filter 

Number 

Filter Size 

/stride /pad 

With 

Relu 

Convll 

conv 

165x120 

64 

3x3 /I /O 

yes 

Conv12 

conv 

163x118 

128 

3x3 /I /O 

yes 

Pooll 

max pool 

161x116 

N/A 

2x2 /2 /O 

no 

Conv21 

conv 

80x58 

64 

3x3 /I /O 

yes 

Conv22 

conv 

78x56 

128 

3x3 /I /O 

yes 

Pool2 

max pool 

76x54 

N/A 

2x2 /2 /O 

no 

Conv31 

conv 

38x27 

128 

3x3 /I /I 

yes 

Conv32 

conv 

38x27 

128 

3x3 /I /I 

yes 

Pool3 

max pool 

38x27 

N/A 

2x2 /2 /I 

no 

Conv41 

conv 

20x14 

256 

3x3 /I /I 

yes 

Conv42 

conv 

20x14 

256 

3x3 /I /I 

yes 

Conv43 

conv 

20x14 

256 

3x3 /I /I 

yes 

Pool4 

max pool 

20x14 

N/A 

2x2 /2 /O 

no 

Conv51 

conv 

10x7 

256 

3x3 /I /I 

yes 

Conv52 

conv 

10x7 

256 

3x3 /I /I 

yes 

Conv53 

conv 

10x7 

256 

3x3 /I /I 

no 

Pool5 

mean pool 

10x7 

N/A 

2x2 /2 /I 

no 

Dropout 

dropout 

6144x1 

N/A 

N/A 

N/A 

Fc6 

fully-conn 

512x1 

N/A 

N/A 

no 

Fc7 

fully-conn 

9000x1 

N/A 

N/A 

no 

Softmax 

softmax 

9000x1 

N/A 

N/A 

N/A 



Fig. 3. The normalized holistic face images and image patches as input for 
MM-DFR. (a) The original holistic face image and the 3D pose normalized 
holistic image; (b) Image patches uniformly sampled from the original face 
image. Due to facial symmetry and the augmentation by horizontal flipping, 
we only leverage the six patches illustrated in the first two columns. 


CNN-P6, respectively, as illustrated in Fig.[^ Exactly the same 
network structure is adopted for each of the six CNNs. 

Different from previous works that randomly sample a 
large number of image patches O, we propose to sample 
a small number of image patches uniformly in the semantic 
meaning, which contributes to maximizing the complementary 
information contained within the sampled patches. However, 
the uniform sampling of the image patches is not easy due to 
the pose variations of the face appeared in real-world images. 
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Fig. 4. The principle of patch sampling adopted in this paper. A set of 3D 
landmarks are uniformly labeled on the 3D face model, and are projected 
to the 2D image. Centering around each landmark, a square patch of size 
100 X 100 pixels is cropped, as illustrated in Fig.[^. 



Fig. 5. More examples about the uniformly detected landmarks that are 
projected from a generic 3D face model to 2D images. 


as shown in Fig. We tackle this problem with a recently 
proposed strategy for pose-invariant face recognition 13^ . The 
principle of the patch sampling process is illustrated in Fig. 

In brief, nine 3D landmarks are manually labeled on a generic 
3D face model and the 3D landmarks spread uniformly across 
the face model. In this paper, we consistently employ the 
mean shape of the Basel Face Model as the generic 3D face 
model El Given a 2D face image, it is first aligned to the 
generic 3D face model using orthogonal projection with the 
help of five facial feature points. Then, the pre-labeled 3D 
landmarks are projected to the 2D image. Lastly, a patch of 
size 100 X 100 pixels is cropped centering around each of 
the projected 2D landmarks. More examples of the detected 
2D uniform landmarks are shown in Fig. It is clear that the 
patches are indeed uniformly sampled in the semantic meaning 
regardless of the pose variations of the face image. 


B. Combination of CNNs using Stacked Auto-Encoder 

We denote the features extracted by the set of CNNs as 
^xk}, where Xi G < i < K. \n this 

paper, K equals to 8 and d equals to 512. The set of features 
represents multimodal information for face recognition. We 
conduct feature-level fusion to obtain a single signature for 
each face image. In detail, the features extracted by the eight 


CNNs are concatenated as a large feature vector, denoted as: 

X = [xi;x2r ” ; xk] ^ ( 2 ) 

X is high dimensional, which is impractical for real-world 
face recognition applications. We further propose to reduce the 
dimension of x by SAE. Compared with the traditional dimen¬ 
sion reduction approaches, e.g., PC A, SAE has advantage in 
learning non-linear feature transformations. In this paper, we 
employ a three-layer SAE. The number of the neurons of the 
three auto-encoders are 2048, 1024, and 512, respectively. The 
output of the last encoder is utilized as the compact signature 
of the face image. The structure for the designed SAE is 
illustrated in Fig. 

Nonlinear activation function is utilized after each of the 
fully-connected layers. Two activation functions, i.e., sigmoid 
function and hyperbolic tangent (tanh) function, are evaluated. 
The forward function of the sigmoid activation function is 
represented as 

^ i^exp(-Wfxi-bf) • (3) 

The forward function of the tanh activation function is repre¬ 
sented as 

rr(^ \ _ exp(wjxi+bf)-exp(-wfXi-bf) 

— exp{Wjxi+bf)+exp{-Wjxi-bf)^ 

where Xi, Wf, and bf are the input, weight, and bias of 
the corresponding fully-connected layer before the activation 
function. Different normalization schemes of x are adopted for 
the sigmoid and tanh activation functions, since their output 
space is different. For the sigmoid function, we normalize the 
elements of x to be within [0,1]. For the tanh function, we 
normalize the elements of x to be within [—1,+!]. In the 
experimentation section, we empirically compare the perfor¬ 
mance of SAE with the two different nonlinearities. 


IV. Each Matching with MM-DER 


In this section, the face matching problem is addressed 
based on the proposed MM-DER framework. Two evaluation 
modes are adopted: the unsupervised mode and the supervised 
mode. Suppose two features produced by MM-DER for two 
images are denoted as yi and y 2 , respectively. In the unsuper¬ 
vised mode, the cosine distance is employed to measure the 
similarity s between yi and y 2 . 


s{yi^y2) 


yiy2 

I|yi|||l2/2ir 


(5) 


Eor the supervised mode, a number of discriminative or 
generative models can be employed (351, EH, (371 . In this 
paper, we employ the Joint Bayesian (JB) model (^ as it is 
shown to outperform other popular models in recent works (a. 
Eor both the unsupervised and supervised modes, the nearest 
neighbor (NN) classifier is adopted for face identification. JB 
models the face generation process as 


X = /i + 5, 


( 6 ) 


where y represents the identity of the subject, while 5 repre¬ 
sents intra-personal noise. 
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JB solves the face identification or verification problems 
by computing the log-likelihood ratio between the probability 
P{xi^X 2 \Hi) that two faces belong to the same subject 
and the probability P{xi^X 2 \He) that two faces belong to 
different subjects. 


r{xi,X 2 ) = log 


P{xi,X2\Hi) 

P{x^,X2\He)' 


(7) 


where r{xi,X 2 ) represents the log-likelihood ratio, and we 
refer to r(xi,X 2 ) as similarity score for clarity in the experi¬ 
mental part of the paper. 


V. EXPERIMENTAL EVALUATION 

In this section, extensive experiments are conducted to 
present the effectiveness of the proposed MM-DER frame¬ 
work. The experiments are conducted on two large-scale 
unconstrained face databases, i.e., LEW lITOl and CASIA- 
WebEace im. Images in both databases are collected from 
internet; therefore they are real images that appear in multi- 
media circumstances. 

The LEW ITOl database contains 13,233 images of 5,749 
subjects. Images in this database exhibit rich intra-personal 
variations of pose, illumination, and expression. It has been 
extensively studied for the research of unconstrained face 
recognition in recent years. Images in LEW are organized into 
two “Views”. View 1 is for model selection and parameter 
tuning while View 2 is for performance reporting. In this paper, 
we follow the official protocol of LEW and report the mean 
verification accuracy and the standard error of the mean (Se) 
by the 10-fold cross-validation scheme on the View 2 data. 

Despite of its popularity, the LEW database contains limited 
number of images and subjects, which restricts its evaluation 
for large-scale unconstrained face recognition applications. 
The CASIA-WebFace El database has been released recently. 
CASIA-WebEace contains 494,414 images of 10,575 subjects. 
As images in this database are collected in a semi-automatic 
way, there is a small amount of mis-labeled images in this 
database. Because there is no officially defined protocol for 
face recognition on this database, we define our own protocol 
for face identification in this paper. In brief, we divide CASIA- 
WebEace into two sets: a training set and a testing set. The 
10,575 subjects are ranked in the descent order by the number 
of their images contained in the database. The 471,592 images 
of the top 9,000 subjects compose the training set. The 22,822 
images of the rest 1,575 subjects make up the testing set. 

All CNNs and SAE in this paper are trained using the 9,000 
subjects in the defined training set above. Images are converted 
to gray-scale and geometrically normalized as described in 
Section III. Eor NNl, we double the size of the training set by 
flipping all training images horizontally to reduce overfitting. 
Therefore, the size of training data for NNl is 943,184. Eor 
NN2, we adopt much more aggressive data augmentation 
by horizontal flipping, image jittering and image down- 
sampling. The size of the augmented training data for NN2 

^For image jittering, we add random gaussian noise on the coordinates of 
the five facial feature points. The noise is distributed with zero mean and 
standard deviation of four pixels. 



Fig. 6. Training data distribution for NNl and NN2. This figure plots the 
number of images for each subject in the training set. The long-tail distribution 
characteristic da of the original training data is improved after the aggressive 
data augmentation for NN2. 


is about 1.8 million. The distribution of training data for NNl 
and NN2 is illustrated in Eig. It is shown that the long-tail 
distribution characteristic ca of the original training data is 
improved after the aggressive data augmentation for NN2. 

We adopt the following multi-stage training strategy to 
train all the CNN models. Eirst, we train the CNN models 
as a multi-class classification problem, i.e., softmax loss is 
employed. Eor all CNNs, the initial learning rate for all 
learning layers is set to 0.01, and is divided by 10 after 
10 epochs, to the final rate of 0.001. Second, we adopt the 
recently proposed triplet loss 13811 for fine-tuning for 2 more 
epochs. We set the margin for the triplet loss to be 0.2 and 
learning rate to be 0.001. It is expected that this multi-stage 
training strategy can boost performance while converge faster 
than using the triplet loss alone l3^ . Eor SAE, the learning 
rate decreases from 0.01 to 0.00001, gradually. We train each 
of the three auto-encoders one by one and each auto-encoder 
is trained for 10 epochs. In the testing phase, we extract 
deep feature from both the original image and its horizontally 
flipped image. Unless otherwise specified, the two feature 
vectors are averaged as the representation of the input face 
image. The open-source deep learning toolkit Caffe l39l is 
utilized to train all the deep models. 

Live sets of experiments are conducted. Eirst, we empirically 
justify the advantage of dense features for face recognition by 
excluding two ReLU nonlinearities compared with previous 
works. The performance of the proposed single CNN model 
is also compared against the state-of-the-art CNN models on 
the LEW database. Next, the performance of the eight CNNs 
contained within the MM-DER framework is compared on face 
verification task on LEW. Then, the fusion of the eight CNNs 
by SAE is conducted and different nonlinearities are also 
compared. We also test the performance of MM-DER followed 
with the supervised classifier JB. Lastly, face identification 
experiment is conducted on the CASIA-WebEace database 
with our own defined evaluation protocol. 
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Fig. 7. Performance comparison on LFW with different usage strategies of 
ReLU nonlinearity. 


A. Performance Comparison with Single CNN Model 

In this experiment, we evaluate the role of ReLU non¬ 
linearity using CNN-Hl as an example. For fast evaluation, 
the comparison is conducted with the simple NNl structure 
described in Table and only the softmax loss is employed 
for model training. Performance of CNN-Hl using the NN2 
structure can be found in Table Vf_ Two paradigm^ are 
followed: 1) the unsupervised paradigm that directly calculate 
the similarity between two CNN features using cosine distance 
metric. 2) the supervised paradigm that uses JB to calculate 
the similarity between two CNN features. For the supervised 
paradigm, we concatenate the CNN features of the original 
face image and its horizontally flipped version as the raw 
representation of each test sample. Then, we adopt PCA for 
dimension reduction and JB for similarity calculation. The 
dimension of the PCA subspace is tuned on the View 1 data 
of LFW and applied to the View 2 data. Both PCA and JB are 
trained on the CASIA-WebFace database. For PCA, to boost 
performance, we also re-evaluate the mean of CNN features 
using the 9 training folds of LFW in 10-fold cross validation. 

The performance of three structures are reported in Fig. |7] 
and Fig. [8] 1) NNl, 2) NNl with ReLU after Conv52 layer 
(denoted as NN1 -fC 52R), and 3) NNl with ReLU after both 
Conv52 and Fc6 (denoted as NNl-i-C52R-i-Fc6R). For both 
NN1-FC52R and NN1 -fC52R-fFc 6R, we replace the average 
pooling layer after Conv 52 with max pooling accordingly. It 
is shown in Fig. that the ReLU nonlinearity after Conv52 or 
Fc6 actually harms the performance of CNN. The experimental 
results have two implications: 1) dense feature is preferable 
than sparse feature for CNN, as intuitively advocated in ifTTI . 
However, there is no experimental justification in ifTTIl . 2) the 
linear projection from the output of the ultimate convolutional 
layer (Conv52) to the low-dimensional subspace (Fc6) is 
better than the commonly adopted non-linear projection. This 
is clear evidence that the negative response of the ultimate 
convolutional layer (Conv52) also contains useful information. 

The performance by single CNN models on LFW is reported 
in Table. [Tin The performance of the state-of-the-art CNN 


^Similar to previous works m, im, both the two paradigms dehned in 
this paper correspond to the “Unrestricted, Labeled Outside Data Results” 
protocol that is officially dehned in (H. 


TABLE III 

Performance Comparison on LFW using Single CNN Model on 
Holistic Face Image 



Accuracy (Unsupervised) 

Accuracy (Supervised) 

DeepFace (8) 

95.92 ± 0.29 

97.00 ± 0.87 

DeepID2 d 

- 

96.33 ± - 

Arxiv2014 QT] 

96.13 ± 0.30 

97.73 ± 0.31 

Facebook 

- 

98.00 ± - 

MSU TR (40l 

96.95 ± 1.02 

97.45 ± 0.99 

Ours (NNl) 

97.32 ± 0.34 

98.05 ± 0.22 

Ours (NN2) 

98.12 ± 0.24 

98.43 ± 0.20 



Fig. 8. ROC curves of different usage strategies of the ReLU nonlinearity 
on LFW. 


models is also tabulated. Compared with Fig. [7] we further im¬ 
prove the performance of NNl by fine-tuning with triplet loss. 
It seems that the triplet loss mainly improves the performance 
for the unsupervised mode in our experiment. It is shown that 
the proposed CNN model consistently outperforms the state- 
of-the-art CNN models under both the unsupervised paradigm 
and supervised paradigm. In particular, compared with im, 
ll40l that all employ the complete CASIA-WebFace database 
for CNN training, we only leverage a subset of the CASIA- 
WebFace database. With more training data, we expect the 
proposed CNN model can outperform the other models with 
an even larger margin. 


B. Performance of the Eight CNNs in MM-DFR 


In this experiment, we present in Table IV the performance 
achieved by each of the eight CNNs contained within the MM- 
DFR framework. We report the performance of CNN-Hl with 
the NN2 structure while the other seven CNNs all employ 
the more efficient NNl structure. The same as the previous 
experiment, both the unsupervised paradigm and supervised 
paradigm are followed. For the supervised paradigm, the PCA 
subspace dimension of the eight CNNs is unified to be 110. 
Besides, features of the original face image and the horizon¬ 
tally flipped version are L2 normalized before concatenation. 
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TABLE IV 

Performance Comparison on LEW of Eight Individual CNNs 



Accuracy (Unsupervised) 

Accuracy (Supervised) 

CNN-Hl 

98.12 ± 0.24 

98.43 ± 0.20 

CNN-H2 

96.47 ± 0.44 

97.67 ± 0.28 

CNN-Pl 

96.83 ± 0.26 

97.30 ± 0.22 

CNN-P2 

97.25 ± 0.31 

98.00 ± 0.24 

CNN-P3 

96.70 ± 0.25 

97.82 ± 0.16 

CNN-P4 

96.17 ± 0.31 

96.93 ± 0.21 

CNN-P5 

96.05 ± 0.27 

97.23 ± 0.20 

CNN-P6 

95.58 ± 0.17 

96.72 ± 0.21 


We find that this normalization operation typically boosts the 
performance of the supervised paradigm by 0.1% to 0.4%. 

When combining Table |I^ and Table it is clear that 
CNN-Hl outperforms CNN-H2 with the same NNl struc¬ 
ture, although they both extract features from holistic face 
images. This maybe counter-intuitive, since the impact of 
pose variation has been reduced for CNN-H2. We explain 
this phenomenon from the following two aspects: 1) most 
images in LFW are near-frontal face images, so the 3D pose 
normalization employed by CNN-H2 does not contribute much 
to pose correction. 2) the errors in pose normalization bring 
about undesirable distortions and artifacts to facial texture, 
e.g., the distorted eyes, nose, and mouth shown in Fig. |^a). 
The distorted facial texture is adverse to face recognition, as 
argued in our previous work (T]. However, we empirically 
observe that the performance of MM-DFR drops slightly on 
View 1 data if we exclude CNN-H2, which indicates CNN- 
H2 provides complementary information to CNN-Hl from a 
novel modality. The contribution of CNN-H2 to MM-DFR is 
also justified by the last experiment in this section. Besides, 
the performance of the patch-level CNNs, i.e., CNN-Pl to 
CNN-P6, fluctuates according to the discriminative power of 
the corresponding patches. 

C. Fusion of CNNs with SAE 

In this experiment, we empirically choose the best non¬ 
linearity for SAE that is employed for feature-level fusion 
of the eight CNNs. The structure of SAE employed in this 
paper is described in Fig. For each CNN, we average the 
features of the original image and the horizontally flipped 
version. L2 normalization is conducted for each averaged 
feature before concatenating the features produced by the 
eight CNNs. Similar to the previous experiment, we find this 
normalization operation promotes the performance of MM- 
DFR. The dimension of the input for SAE is 4,096. Two 
types of non-linearities are evaluated, the sigmoid non-linearity 
and the tanh non-linearity, denoted as SAE-SIG and SAE- 
TANH, respectively. The output of the third encoder (before 
the nonlinear layer) is utilized as the signature of the face 
image. Cosine distance is employed to evaluate the similarity 
between two face images. SAE are trained on the training 
set of CASIA-WebEace, using feature vectors extracted from 
both the original images and the horizontally Hipped ones. 
The performance of SAE-SIG and SAE-TANH is 98.33% and 



Fig. 9. Performance comparison between the proposed MM-DFR approach 
and single modality-based CNN on the face verification task. 

97.90% on the Viewl data of LEW, respectively. 

SAE-TANH considerably outperforms SAE-SIG. One im¬ 
portant difference between the sigmoid non-linearity and the 
tanh non-linearity is that they normalize the elements of the 
feature to be within [0,1] and [—1,1], respectively. Compared 
with the tanh non-linearity, the sigmoid non-linearity loses 
the sign information of feature elements. However, the sign 
information is valuable for discriminative power. 

D. Performance of MM-DFR with Joint Bayesian 

The above three experiments have justified the advantage of 
the proposed CNN structures. In this experiment, we further 
promote the performance of the proposed framework. 

We show the performance of MM-DER with JB, where the 
output of MM-DER is utilized as the signature of the face 
image. We term this face recognition pipeline as MM-DER-JB. 
Eor comparison, the performance achieved by CNN-Hl with 
the JB classifier is also presented, denoted as “CNN-Hl -f JB”. 
The performance of the two systems is tabulated in Table |V] 
and the ROC curves are illustrated in Eig. It is shown 
that MM-DER considerably outperforms the single modal- 
based approach, which indicates the fusion of multimodal 
information is important to promote the performance of face 
recognition systems. By excluding the five labeling errors in 
LEW, the actual performance of MM-DER-JB reaches 99.10%. 

Our simple 8-net based ensemble system also outperforms 
DeepID2 (91, which includes as much as 25 CNNs. Some more 
recent approaches that were published after the submission of 
this paper, e.g. (381, ED, achieve better performance than 
MM-DER. However, they either employ significantly larger 
private training dataset or considerably larger number of CNN 
models. In comparison, we employ only 8 nets and train the 
models using a relatively small training set. 

E. Face Identification on CASIA-WebFace Database 

The face identification experiment is conducted on the test 
data of the CASIA-WebEace database, which includes 22,822 
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TABLE V 

Performance Evaluation of MM-DFR with JB 



#Nets 

Accuracy (% )3zSe 

CNN-Hl -h JB 

1 

98.43 ± 0.20 

DeepFace [S] 

7 

97.25 ± 0.81 

MSU TR (40) 

7 

98.23 ± 0.68 

DeepID2 (U 

25 

98.97 ± 0.25 

MM-DFR-JB 

8 

99.02 ± 0.19 


TABLE VI 

The rank-1 identification rates by Different Combinations of 
Modalities on CASIA-WebFace Database 



Identification Rates 

CNN-Hl + JB 

72.26% 

CNN-H2 + JB 

69.07% 

CNN-Hl &H2 + JB 

74.51% 

CNN-Pl to P6 + JB 

76.01% 

MM-DFR-JB 

76.53% 



Fig. 10. CMS curves by different combinations of modalities on the face 
identification task. 


images of 1,575 subjects. For each subject, the first five images 
are selected to make up the gallery set, which can generally be 
satisfied in many multimedia applications, e.g., social networks 
where each subject has multiple face images. All the other 
images compose the probe set. Therefore, there are 7,875 
gallery images and 14,947 probe images in total. 

The rank-1 identification rates by different combinations 
of modalities are tabulated in Table |Vl| The corresponding 
Cumulative Match Score (CMS) curves are illustrated in 
Fig. It is shown that although very high face verification 
rate has been achieved on the LFW database, large-scale face 
identification in real-world applications is still a very hard 
problem. In particular, the rank-1 identification rate by the 
proposed approach is only 76.53%. 

It is clear that the proposed multimodal face recognition 
algorithm significantly outperforms the single modal based 
approach. In particular, the rank-1 identification rate of MM- 
DFR-JB is higher than that of “CNN-Hl -f JB” by as much as 
4.27%. “CNN-Hl -f JB” outperforms “CNN-H2 -f JB” with 
a large margin, partially because CNN-Hl is based on the 
larger architecture NN2 and trained with more aggressively 
augmented data. However, the combination of the two modal¬ 
ities still considerably boosts the performance by 2.25% on the 
basis of CNN-Hl, which forcefully justifies the contribution of 
the new modality introduced by 3D pose normalization. These 
experimental results are consistent with those obversed on the 
LFW database. Experimental results on both datasets strongly 
justify the effectiveness of the proposed MM-DFR framework 
for multimedia applications. 

VI. Conclusion 

Face recognition in multimedia applications is a challeng¬ 
ing task because of the rich appearance change caused by 
pose, expression, and illumination variations. We handle this 
problem by elaborately designing a deep architecture that 
employs complementary information from multimodal image 
data. First, we enhance the recognition ability of each CNN 


by carefully integrating a number of published or our own 
developed tricks, such as deep structures, small filters, care¬ 
ful use of ReLU nonlinearity, aggressive data augmentation, 
dropout, and multi-stage training with multiple losses, L2 
normalization. Second, we propose to extract multimodal 
information using a set of CNNs from the original holistic 
face image, the rendered frontal pose image by 3D model, 
and uniformly sampled image patches. Third, we present the 
feature-level fusion approach using stacked auto-encoders to 
fuse the features extracted by the set of CNNs, which is ad¬ 
vantageous to learn non-linear dimension reduction. Extensive 
experiments have been conducted for both face verification 
and face identification experiments. As the proposed MM-DFR 
approach effectively employs multimodal information for face 
recognition, clear advantage of MM-DFR is shown compared 
with the single modal-based algorithms and some state-of-the- 
art deep models. Other deep learning based approaches may 
also benefit from the structures that have been proved to be 
useful in this paper. In the future, we will try to integrate more 
multimodal information into the MM-DFR framework and 
further promote the performance of single deep architecture 
such as NN2. 
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